2 Web Scraping
2.1 Review
APIS
Some read_excel content in 02
2.2 Introduction and Motivation
The Internet is an immense source of information for research. Sometimes we can easily download data of interest in an ideal format with the click of a download button or a single API call.
But it probably won’t be long until we need data that require many download button clicks. Or worse, we may want data from web pages that don’t have a download button.
Consider a few examples.
- The Urban Institute’s Boosting Upward Mobility from Poverty project programmatically downloaded 51 .xslx workbookswhen building the Upward Mobility Data Tables
- We worked with the text of executive orders going back to the Clinton Administration when learning text analysis in DSPP1. Unfortunately, the Federal Register doesn’t publish a massive file of executive orders. So we iterated through websites for each executive order, scraped the text, and cleaned the data.
- The Urban Institute scraped course descriptions from Florida community colleges to understand opportunities for work-based learning.
We will explore two approaches for gathering information from the web.
- Iteratively downloading files: Sometimes websites contain useful information across many files that need to be separately downloaded. We will use code to download these files. Ultimately, these files can be combined into one larger data set for research.
- Scraping content from the body of websites: Sometimes useful information is stored as tables or lists in the body of websites. We will use code to scrape this information and then parse and clean the result. This is similar to using web APIs, only the information is not desired to be read by code.
Sometimes we download many PDF files using the first approach. A related method that we will not cover that is useful for gathering information from the web is extracting text data PDFs.
2.3 Legal and Ethical Considerations
It is important to consider the legal and ethical implications of any data collection. Collecting data from the web through methods like web scraping raises serious ethical and legal considerations. We will use these methods for good, not evil, and sometimes we will decide to not collect the data even when it would be useful. The Internet is full of scrapers and crawlers with nefarious intentions. We will not take our lead from the worst actors.
2.3.1 Legal1
Different countries have different laws that affect web scraping. The United States has different laws and legal interpretations than countries in Europe, which are largely regulated by the European Union. In general, the United States has more relaxed policies than the European when it comes to gathering data from the web.
R for Data Science (2e) contains a clear and approachable rundown of legal consideration for gathering information for the web. We adopt their three-part standard of “public, non-personal, and factual”, which relate to terms of service, personally identifiable information, and copyright.
We will focus solely on laws in the United States.
Terms of Service
The legal environment for web scraping is in flux, but US Courts have created an environment that is legally supportive of gathering public information from the web.
First, we need to understand how many websites bar webscraping. Second, we need to understand when we can ignore these rules.
A terms of service is a list of rules posted by the provider of a website, web server, or software.
Terms of Service for many websites bar web scraping.
For example, LinkedIn’s Terms of Service says users agree to not “Develop, support or use software, devices, scripts, robots or any other means or processes (including crawlers, browser plugins and add-ons or any other technology) to scrape the Services or otherwise copy profiles and other data from the Services;”
This sounds like the end of web scraping, but as Wickham, Çetinkaya-Rundel, and Grolemund (2023) note, Terms of Service end up being a “legal land grab” for companies. It isn’t clear how LinkedIn would legally enforce this. HiQ Labs v. LinkedIn from the United States Court of Appeals for the Ninth Circuit bars Computer Fraud and Abuse Act (CFAA) claims against web scraping public information.2
We follow a simple guideline: it is acceptable to scrape information when we don’t need to create an account or click a box.
2.3.2 PII
Personal Identifiable Information (PII) is any information that can be used to directly identify an individual.
Public information on the Internet often contains PII, which raises legal and ethical challenges. We will discuss the ethics of PII later.
The legal considerations are trans-Atlantic. The General Data Protection Regulation (GDPR) is a European Union regulation about information privacy. It contains strict rules about the collection and storage of PII. It applies to almost everyone collecting data inside the EU. The GDPR is also extraterritorial, which means its rules can apply outside of the EU under certain circumstances like when an American company gathers information about EU individuals.
We will avoid gathering PII so we don’t need to consider PII
Copyright
Copyright protection subsists, in accordance with this title, in original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. Works of authorship include the following categories:
- literary works;
- musical works, including any accompanying words;
- dramatic works, including any accompanying music;
- pantomimes and choreographic works;
- pictorial, graphic, and sculptural works;
- motion pictures and other audiovisual works;
- sound recordings; and
- architectural works.
In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.
Our final legal concern for gathering information from the Internet is copyright law. We have two main options for avoiding copyright limitations.
- We can avoid copyright protections by avoiding authorship in the protected categories (i.e. literary works and sound recordings). Fortunately, this include most data, which Wickham, Çetinkaya-Rundel, and Grolemund (2023) call “facts”.
- We can scrape information that is fair use. This is important if we want to use images, films, music, or extended text as data.
We will focus on data that are not copyrighted.
2.3.3 Ethical
We now turn to ethical considerations and some of the best-practices for gathering information from the web. In general, we will aim to be polite, give credit, and respect individual information.
Be polite
It is expensive and time-consuming to host data on the web. Hosts experience a small burden every time we access a website. This burden is small but can quickly grow with repeated queries. Just like with web APIs, we want to pace the burden of our access to be polite.
Rate limiting is the intentional slowing of web traffic for a user or users.
If we use a custom function to pull information from the web, simply add Sys.sleep() to ease the burden on web hosts.
robots.txt tells web crawlers and scrapers which URLs the crawler is allowed to access on a website.
Many websites contain a robots.txt file. Consider examples from the Urban Institute and White House
We can manually look at the robots.txt. For example, just visit https://www.urban.org/robots.txt or https://www.whitehouse.gov/robots.txt. We can also use library(polite), which will automatically look at the robots.txt.
Give Credit
Academia and the research profession undervalue the collection and curation of data. Generally speaking, no one gets tenure for constructing even the most important data sets. It is important to give credit for data accessed from the web. Ideally, add the citation to Zotero and then easily add it to your manuscript in Quarto.
Be sure to make it easy for others to cite data sets that you create. Include an example citation like IPUMS or create a DOI for your data.
The rise of generative AI models like GPT-3, Stable Diffusion, DALL-E 2 makes urgent considerations of giving credit. These models consume massive amounts of training data and it isn’t clear who sources all of the training data and the legal and ethical implications.3
Consider a few current events:
- Sarah Silverman is suing OpenAI because she “never gave permission for OpenAI to ingest the digital version of her 2010 book to train its AI models, and it was likely stolen from a ‘shadow library’ of pirated works.”
- Somepalli et al. (2023) use state-of-the-art image retrieval models to find that generative AI models for image like the popular the popular Stable Diffusion model “blatantly copy from their training data.” This is a major problem if the training data are copyrighted. The first page of their paper (here) contains some dramatic examples.
- Finally, this HBR article discusses the intellectual property problem facing generative AI.
Respect Individual Information
Data science methods should adhere to the same ethical standards as any research method. The social sciences have ethical norms about protecting privacy (discussed later) and informed consent.
Let’s consider an example. In 2016, researchers posted data about 70,000 OkCupid accounts.4 The data didn’t contain names, but it did contain usernames. The data also contained many sensitive variables including topics like sexual habits and politics.
The release drew strong reactions from some research ethicists including Michael Zimmer and Os Keyes.
Fellegi (1972) defines data privacy as the ability “to determine what information about ourselves we will share with others”. Maybe OkCupid users made the decision to forego confidentiality when they publsihed their accounts. Many institutional ethics committees do not require informed consent for public data.
Ravn, Barnwell, and Barbosa Neves (2020) do a good of trying to bridge the gap with a case study on Instagram.
It’s possible to conceive of a web scraping research project that is purely observational that adheres to the ethical standards of research and contains potentially disclosive information about individuals. Fortunately, researchers can typically use Institutional Review Boards and research ethicists to navigate these questions.
As a basic standard, we will avoid collecting PII and use anonymization techniques to limit the risk of re-identification.
We will also focus on applications where the host of information crudely shares the information. There are ample opportunities to create value by gathering information from government sources and converting it into more useful formats. For example, the government too often shares information in .xls and .xlsx files, clunky web interfaces, and PDFs.
2.4 Programatically Downloading Data
*The County Health Rankings & Roadmaps is a source of state and local information.
Suppose we are interested in Injury Deaths at the state level. We can click through the interface and download a .xlsx file for each state.
- Start here.
- Using the interface at the bottom of the page, we can navigate to the page for “Virginia.”
- Next, we can click “View State Data.”
- Next, we can click “Download Virginia data sets.”
That’s a lot of clicks to get here.
If we want to download “2023 Virginia Data”, we can typically right click on the link and select “Copy Link Address”. This should return one of the following two URLS:
https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsxhttps://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Virginia Data - v2.xlsx
If we plug that URL into a web browser it will automatically download the file. Alternatively, we can use download.file() to download the file provided we include a destfile.
If we poke around, we can see that all of the state data follows a common pattern. For example, the URL for Vermont is * https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Vermont Data - v2.xlsx
The names only differ by "Virginia" and "Vermont". Now we can iterate downloading the pages. We will only download data for two states, but we can imagine downloading data for many states or many counties.
- A couple of tips:
paste0()andstr_glue()fromlibrary(stringr)are useful for creating URLs and destination files.walk()fromlibrary(purrr)can iterate functions. It’s likemap(), but we use it when we are interested int he side-effect of a function.- Sometimes data are messy and we want to be polite. Custom functions can help with rate limiting and cleaning data.
- Bonus:
read_csv()can directly read .csvs from the Internet. However, I still like to download the data because the Internet is a moving target.
states <- c("Virginia", "Vermont")
urls <- paste0(
"https://www.countyhealthrankings.org/sites/default/files/",
"media/document/2023 County Health Rankings ",
states,
" Data - v2.xlsx"
)
output_files <- paste0("data/", states, ".csv")
download_chr <- function(url, destfil) {
download.file(url, destfile)
Sys.Sleep(0.5)
}
walk(.x = urls, .y = output_files, .f = download_chr)We are not lawyers. This is not official legal advise. If in-doubt, please contact a legal professional.↩︎
This blog and this blog support this statement. Again, we are not lawyers and the HiQ Labs v. LinkedIn decision is complicated because of its long history and conclusion in settlement.↩︎
The scale of crawling is so great that there is concern about models converging once all models use the same massive training data. Common Crawl is one example. This isn’t a major issue for generating image but model homogeneity is a big concern in finance.↩︎
Every year, newspapers across the country FOIA information about government employees and publish their full names, job titles, and salaries.↩︎